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AN  INQUIRY  INTO  THE  BENEFITS  OF  MULTIGAUGE  PARALLEL  <  OMI’l  TATION 

Lawrence  Snyder 


Department  of  Computer  Science 
University  of  Washington 
Seattle.  Washington  98195 


Abstract-A  multigauge  parallel  computer  is  a  machine  whose 
processor  elements  can  be  partitioned  into  distinct  processors 
with  narrower  data  paths.  ILLIAC  IV  was  a  multigauge  ma¬ 
chine  This  paper  addresses  the  question  of  whether  multigauge 
computation  is  useful.  First,  the  concept  is  defined.  Next  it  is 
argued  that  multigauge  machines  offer  truly  new  computational 
facilities,  rather  than  being  other  architectures  in  disguise.  Fi¬ 
nally.  a  class  of  algorithms  is  identified  that  can  exploit  multi¬ 
gauge  architectures  and  an  example  is  presented  to  illustrate 
multigauge  machine  performance  improvements. 

Introduction 


In  parallel  computation  speed  comes  from  organising  many 
processors  to  solve  a  single  problem,  so  it  is  neural  to  think 
of  accelerating  a  parallel  computer  by  adding  more  processors 
rather  than  by  speeding  up  those  currently  in  use.  But  speeding 
up  the  processor  elements  (PEs)  is  an  effective  way  to  improve 
performance.  For  example,  a  factor  of  two  improvement  in  PE 
speed  yields  a  factor  of  two  improvement  in  instruction  execu¬ 
tions  per  second  and  this  can  often  be  done  with  only  a  modest 
amount  of  extra  hardware;  achieving  the  same  improvement  by 
adding  PEs  requires  at  least  twice  the  hardware.  (Utilizing  the 
performance  improvement  has  its  problems  with  either  solution: 
Faster  PEs  cause  memory  latency  to  have  a  greater  effect  on 
observed  performance,  and  more  PEs  exacerbate  communica¬ 
tion  bottlenecks.)  Clearly,  making  faster  PEs  is  only  a  tactic  in 
i he  battle  for  improved  parallel  computer  performance  because 
there  is  a  limit  to  how  fast  a  sequential  processor  can  get,  and 
the  greater  the  speed  of  a  PE,  the  greater  the  cost  of  improv¬ 
ing  on  it.  Providing  more  PEs  is  the  strategy  that  will  win  in 
the  long  run.  For  any  given  situation,  however,  the  question  is: 
more  PEs  or  faster  PEs? 

One  technique  with  elements  of  both  approaches  is  to  intro¬ 
duce  multiprocessing  into  the  PEs.  The  technique,  called  gauge 
shifting,  exploits  the  fact  that  data  types  come  in  different  sires 
and  the  smaller  ones  might  be  processed  concurrently  by  par¬ 
titioning  the  data  path.  The  first  machine  capable  of  gauge 
shifting  was  ILLIAC  IV  |l);  the  64  64-bit  PEs  could  also  be 
used  as  128  32-bit  PEs  or  as  512  8-bit  PEs.  1  Although  some 
programs  were  written  for  ILLIAC  IV  using  the  32-bit  gauge 
PF.s.  the  machine  was  apparently  never  used  in  the  way  pro¬ 
posed  here,  namely  to  shift  back  and  forth  between  different 
gauges  dynamically. 

;  Whether  one  thinks  of  aa  a  procewor  machine  with  64  bit  PEa  as  becom¬ 
ing  a  ?n  processor  machine  with  32-bit  PEa.  or  becoming  an  n  processor 
machine  with  dual  32  bit  PEa.  dependa  on  other  aspects  of  the  architec 
lure  as  described  in  the  second  section 
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The  purpose  of  this  paper  is  to  inquire  into  the  benefits  of 
the  general  idea  of  gauge  shifting  as  a  means  of  improving  par¬ 
allel  computer  performance.  The  presentation  includes  a  more 
precise  definition  of  the  concept,  the  identification  of  a  class  of 
algorithms  capable  of  exploiting  dynamic  gauge  shifting,  and  an 
investigation  as  to  whether  gauge  shifting  constitutes  a  funda¬ 
mentally  different  form  of  compulation. 

Definitions  and  Background 


In  this  section  the  concepts  alluded  to  in  the  introduction 
are  made  more  precise  and  relevant  related  work  is  cited. 

A  multigauge  architecture  is  a  sequential  computer  with  a 
data  path  width  of  B  Cits,  called  the  wide  track  machine,  which 
can  be  partitioned  into  k  distinct  sequential  machines,  called 
narrow  track  machines,  each  with  a  [0/k|-bil  wide  data  path. 
It  is  convenient  to  permit  a  von  Neumann  machine  to  be  a  trivial 
(i.e.  k=l)  multigauge  machine,  and  the  term  dual  track  will  be 
used  to  refer  to  the  case  where  only  one  nontrivial  value  of  k  is 
implemented. 

Notationally,  B  will  denote  the  wide  track  width,  h  will  de¬ 
note  the  narrow  track  width,  and  it  is  assumed  hereafter  that 
h  ■  k  =  B.  The  multigauge  machine  can  be  described  by  listing 
the  different  track  widths  it  supports.  Thus  the  ILLI  AC  IV  PEs 
would  be  (64,  32,  8)  multigauge  machines. 

The  instructions  executed  by  the  narrow  track  machines  can 
form  either  a  single  stream,  i.e.  an  SIMD  multigauge  architec¬ 
ture,  or  multiple  streams,  i.e.  an  MIMD  multigauge  architecture. 
but  there  are  some  pragmatic  limits.  For  example,  it  seems  un¬ 
realistic  when  6  =  f  to  postulate  MIMD  execution  since  fetch¬ 
ing  separate  instructions  for  each  bit  of  the  data  path,  decoding 
them,  calculating  operand  addresses,  and  fetching  disparate  bits 
from  memory  is  excessive  effort  for  the  amount  of  computation 
being  performed.  So,  postulate  the  MIMD  threshold,  the  num¬ 
ber  of  bits  wide  a  narrow  gauge  machine  must  be  before  MIMD 
execution  is  “justified"  Here  the  MIMD  threshold  will  be  taken 
to  be  right  bits:  shifting  to  gauges  narrower  than  eight  bits  will 
be  assumed  to  be  SIMD  execution. 

A  multigauge  parallel  computer  is  a  parallel  machine  whose 
PEs  are  capable  of  gauge  shifting  There  are  two  ways  to  im¬ 
plement  this  capability:  The  machine  with  wide  track  I'F.s  and 
the  machine  with  narrow  track  PEs  are  each  instances  of  ihr 
same  architectural  family,  i.e.  if  the  sire  of  the  PEs  is  ignored, 
the  narrow  track  machines  appear  to  be  versions  of  the  wide 
track  architecture  scaled  up  to  more  PEs.  Alternatively  the  ar¬ 
chitectural  relationships  do  not  change  as  a  result  of  shifting 
except  that  the  PEs  become  small  multiprocessors.  The  former 
are  referred  to  as  Type  A  mulligauge  parallel  machine-  anil  the 
latter  as  Type  B  multigauge  parallel  machine  To  illustrate,  an 
8x8  mesh-connected  architecture  of  C4-bit  PEs  that  -Infix  to  a 
16  x  16  mesh-connected  architecture  with  16-bit  PE-  is  a  Type 
A  multigauge  parallel  computer  Alternatively,  if  the  machine 
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remains  an  8  x  S  mesh  but  each  HE  becomes  a  quad  processor, 
(hen  this  would  be  a  Type  B  multigauge  parallel  architecture. 

The  multigauge  concepts  discussed  here  are  reminiscent  of 
several  previous  studies.  The  “Dynamic  Architecture’  of  Kar- 
taehev  and  Kartachev  (2)  is  based  on  connecting  together  basic 
narrow  track  (e  g  16-bit)  computers  on  a  bus  to  achieve  a  wider 
word  width.  The  wide  track  machine,  called  a  dynamic  computer 
group,  can  be  simultaneously  shifted  into  different  gauges  as  long 
as  the  narrow  track  machines  each  have  a  multiple  of  16-bit  data 
path  width.  The  concept  of  switching  between  SIMD  and  MIMD 
execution  modes  has  been  most  fully  developed  in  the  Parti¬ 
tioned  Array  SIMD-MIMD  (PASM)  Computer  of  H.  J.  Siegel 
and  his  colleagues  j3j.  PASM  uses  the  same  fixed  size  processor 
elements  in  both  modes.  The  Very  Long  Instruction  Word  Ar¬ 
chitectures  of  Fisher  [4]  utilize  several  independent,  fixed  gauge 
AH's  that  are  neither  split  or  joined.  The  Content  Addressable 
Array  Processor  (CAAP)  of  Weems  et  al.  [5]  couples  several  nar¬ 
row  gauge  machines  with  a  wide  gauge  machine;  the  machines 
are  distinct  rather  than  being  restructurings  of  the  same  hard¬ 
ware  Similarities  with  other  architectures  undoubtedly  remain 
to  be  explored. 

To  close  this  section  notice  that  certain  other  implementa¬ 
tions  besides  ILLIAC  IA'  provide  some  degree  of  gauge  shifting. 
For  example,  the  Cyber  205  [6]  can  partition  the  64-bit  data 
path  into  two  12-bit  data  paths. 

An  Analysis  of  Benefits 


Multigauge  computers  appear  to  offer  a  benefit  over  fixed 
gauge  machines  on  computations  involving  small  size  data  types, 
since  k  narrow  gauge  machines  provide  jk-fold  parallelism.  But 
because  the  multigauge  idea  uses  essentially  the  same  hardware 
with  only  modest  enhancements,  one  wonders  if  perhaps  the 
speedup  is  only  an  illusion.  This  concern  is  further  strength¬ 
ened  by  the  observation  that  k  1-bit  machines  each  performing 
an  .4  \D  is  essentially  equivalent  to  a  k-bit  fixed  gauge  machine 
executin'  a  bit-wise  4 ,VD.  Therefore,  it  is  necessary  to  argue 
that  mulligauge  computation  is  a  fundamentally  different  phe¬ 
nomenon.  and  we  will.  In  addition,  we  will  identify  the  exact 
source  of  the  improved  performance. 

Refore  beginning,  we  make  some  preliminary  observations. 
First,  it  is  appropriate  to  limit  our  arguments  to  multigauge  ma¬ 
chines  as  opposed  to  parallel  multigauge  machines,  since  we  are 
interested  in  the  multigauge  phenomenon  alone,  and  the  argu¬ 
ments  either  extend  directly  to  the  parallel  case  or  become  more 
complicated  due  to  interactions  with  other  parts  of  the  parallel 
architecture  Second,  we  will  assume  that  multigauge  machines 
have  comparable  performance  to  like  gauge  sequential  comput¬ 
ers  This  is  a  significant  assumption  because  there  is  somewhat 
greater  complexity  with  a  multigauge  machine,  and  so  we  are 
assuming  that  it  is  completely  transparent  during  wide  track 
execution  (We  also  assume,  though  perhaps  somewhat  less  re¬ 
alistically.  that  the  narrow  gauge  instructions  run  at  the  same 
rate  )  Another  reason  why  comparable  performance  is  a  signif¬ 
icant  assumption  is  that  there  are  many  strategies  for  speeding 
up  sequential  machines,  and  these  may  not  be  compatible  with 
the  multigauge  approach  (or  each  other  for  that  matter)  So 
adopting  a  multigauge  design  may  preclude  other  optimizations. 
Still  comparable  performance  is  a  plausible  assumption  to  get 
us  started;  if  fundamental  benefits  can  be  identified,  the  detailed 
design  needed  to  resolve  these  other  issues  will  be  justified 
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To  understand  how  performance  gains  might  accrue  from 
gauge  shifting,  compare  the  narrow  track  machines  executing 
in  SIMD  mode  with  a  standard  sequential  machine.  For  the 
comparison  to  be  interesting  suppose  that  instructions  exist  for 
the  sequential  machine  so  all  data  types  of  small  size  can  be 
treated  like  the  bit-wise  ASDs  mentioned  above  Specifically,  a 
standard  k  5-bit  sequential  machine  having  (t  -  1)5  unused  bits 
when  computing  on  6-bit  data  has  its  instruction  set  extended 
to  support  k  6-bit  operations  elementwise  within  a  word  For 
example,  in  addition  to  logicals  one  might  have  instructions  to 
do  two  half  word  ADDs,  etc.  Such  an  extended  sequential  ma¬ 
chine  would  not  be  equivalent  to  a  multigauge  machine  executing 
even  in  SIMD  mode  because,  although  there  is  one  instruction 
stream  in  force  in  both  machines  and  multiple  data  values  bring 
manipulated  in  both,  there  is  but  one  address  for  each  operand 
group  of  the  extended  sequential  machine.  Data  values  must  be 
packed  together  in  a  word  to  achieve  fc-way  parallelism  There 
is  no  such  restriction  for  the  multigauge  machine.  : 

For  the  two  machines  to  have  equal  performance  requires 
‘.hat  every  algorithm  using  differently  addressed  data  streams 
>f  narrow  width  data  be  convertible  into  a  packed  form  that 
can  be  referenced  by  a  single  address  stream.  This  seems  to  be 
extremely  unlikely.  The  point  of  the  comparison  is  twofold:  The 
bitwise  A\D  is  really  a  special  case  rather  than  being  a  good 
example  of  the  multigauge  idea,  and  although  construction  of  a 
multigauge  machine  will  engender  certain  costs  associated  with 
supporting  multiple  operand  fetching,  the  feature  has  apparent 
benefit  even  in  the  SIMD  case. 

Having  focussed  on  multiple  operand  fetching,  we  now  ad¬ 
dress  the  benefits  of  multiple  instruction  fetching.  (The  argu¬ 
ment  amounts  to  a  defense  that  MIMD  computation  is  more 
powerful  than  SIMD  computation  )  Postulate  a  sequential  ma¬ 
chine  capable  of  packing  several  operator/operand  specifications 
together  in  a  single  instruction  word.  Such  a  machine,  though 
still  with  only  a  single  program  counter,  would  be  able  to  exe¬ 
cute  several  distinct  programs  provided  that  a  particular  condi¬ 
tion  could  be  enforced  on  the  execution  sequence,  namely,  lhat 
they  remain  “in  unison’.  In  particular,  let 
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be  programs.  As  Ion*  as  these  art  straight  line  code,  the  instruc¬ 
tions  and  their  operands  can  be  packed  together  in  instruction 
words. 


<  h.J i.  .  A'i  > 

<  A .  J+ ,  ■  ,  A  *  > 

K  If% ,  Jn ,  .  A  n  ^ 

and  be  executed  by  a  machine  with  a  single  program  counter  If 
there  is  a  conditional  branch,  say  in  J\  with  target  instructions 
Jz  and  A.  then  we  need  to  provide  another  sequence  of  packed 
instructions 


*lt  may  be  lhat  once  the  data  type*  get  too  narrow,  addretting  resin;:,  ion* 
mutt  apply  for  the  lame  reaiont  motivating  the  MIMD  thrr*h.-'d  If  the 
limitation  n  to  a  tingle  data  ttream.  the  machine*  c^uid  be  rqimilent 
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so  that  the  branching  can  be  provided  for  and  still  remain  ‘in 
unison' 

Obviously,  storing  the  whole  execution  sequence  is  unrealis¬ 
tic.  so  we  concentrate  on  storing  short  segments  that  represent 
the  instructions  that  could  be  executing  concurrently.  Consider 
programs  composed  of  short  blocks  which  each  test  a  value, 
change  It  and  thro  jump  to  one  of  two  different  blocks  depend¬ 
ing  on  the  outcome  of  the  test.  Each  program  will  jump  around 
to  different  locations  in  an  unpredictable  order.  When  we  con¬ 
sider  the  programs  together  we  see  that  any  given  instruction 
could  be  executing  with  any  combination  of  instructions  from 
the  other  programs.  Thus  to  generate  a  program  that  can  be 
executed  with  a  single  instruction  counter  requires  that  essen¬ 
tially  all  tuples  of  instructions,  one  from  each  program,  must  be 
provided  for. 

Retuming  now  to  the  question  of  multigauge  computation 
being  faster  than  an  equivalent  sequential  computation,  we  note 
that  programs  of  the  type  just  discussed  can  be  stored  in  O(kn) 
-■pace  on  a  multigauge  machine  but  will  require  0(nk)  space  on  a 
sequential  machine.  This  disparity  is  too  great  to  be  the  basis  of 
a  fair  comparison,  as  can  be  seen  by  instantiating  the  functions 
for  realistic  size  values  such  as  k=i  and  n=IOO.  Assuming  the 
sequential  machine  is  limited  to  a  comparable  amount  of  space, 
it  must  cease  to  exploit  packed  instructions  and  thus  be  reduced 
to  executing  the  programs  (essentially^  separately.  The  result¬ 
ing  longer  execution  times  imply  the  existence  of  a  fundamental 
performance  improvement  with  multigauge  computation. 

The  conclusions  from  the  preceding  discussion  are  that  multi¬ 
gauge  computation  is  fundamentally  different  from  sequential 
computation  and  that  potential  performance  improvements  ex¬ 
ist  Fetching  multiple  operands  and  multiple  instructions,  though 
complicating  to  the  machine  design,  have  been  shown  t)  be  i 
source  of  power.  Whether  the  benefits  can  actually  be  real. zed  in 
a  physical  design  is  an  interesting  and  challenging  open  problem. 

Since  it  would  seem  that  a  multigauge  architecture  will  es¬ 
sentially  be  many  program  counters,  control  units,  instruction 
decoders,  etc  .  sharing  a  data  path  and  a  memory,  neither  of 
which  is  a  very  scarce  resource,  it  is  evidently  not  the  case  that 
gauge  shifting  is  justified  on  pur*lv  economic  grounds  It  is. 
therefore,  appropriate  to  close  this  section  with  a  brief  pbilo- 
sopbical  discussion  of  additional  benefits  of  multigauge  (parallel) 
computation.  One  advantage  is  that  multigauge  machines  neu¬ 
tralize  a  rather  pointless  argument  about  the  merits  of  ‘coarse 
gram*  versus  ‘fine  grain*  computation;  these  machines  can  be 
either,  as  appropriate.  More  importantly,  multigauge  architec¬ 
tures  respond  to  the  fact  that  certain  problems  display  several 
types  of  computational  needs  -  voluminous  but  rather  direct 
data  manipulation  followed  by  much  more  complex,  sophisti¬ 
cated  processing  (A  more  detailed  description  is  given  in  the 
next  section  )  The  key  point  is  that  multigauge  machines  can  do 
both  with  respect  to  the  same  memory.  It  is  not  that  memory 
is  expensive,  but  rather  that  data  occupancy  is.  Once  in  mem¬ 
ory.  data  should  be  processed  where  it  resides  rather  than  being 
moved  about,  unchanged  and  thus  introducing  overhead.  This 
aspect  is  extremely  important  for  nonshared  memory  architec¬ 
tures  Finally,  there  are  the  esthetics  of  being  able  to  describe 
directly  different  gauge  computations  rather  than  encoding  one 


in  another  Of  course,  no  matter  how  elegant,  a  machine  doesn  t 
count  for  much  unless  it  is  useful  for  some  important  problems; 
so  we  consider  algorithms  that  can  exploit  gauge  shifting. 

Two  Tier  Algorithms 

Although  we  have  concentrated  on  the  architecture!  issues  of 
gauge  shifting,  the  motivation  for  studying  the  phenomenon,  as 
indicated  in  the  last  section,  is  to  support  the  execution  of  cer¬ 
tain  kinds  of  algorithms,  the  general  class  of  w  hich  we  call  muf/i- 
tier  algorithms  *  The  simplest  members  of  the  algorithmic  class 
are  Iro  fier  algorithms  w  hich  have  the  property  that  there  is  an 
enormous  amount  of  simple  data  processing  on  small  si/e  data 
items,  followed  by  more  complirated  processing  on  more  com¬ 
plex  data  structures.  Problems  requiring  two  tier  algorithms  for 
their  solution  arise  in  many  applications  arras  such  as  artificial 
intelligence,  data  bases  and  image  processing  For  example,  in 
image  processing  the  first  tier  would  involve  pixel  level  process¬ 
ing  where  regions  of  two  images  might  be  correlated  to  register 
the  two  pictures.  The  higher  tier  processing  focusses  on  -iirh 
activities  as  motion  detection. 

Two  tier  algorithms  are  ideal  for  execution  on  a  mulligauge 
computer.  The  narrow  gauge  processing  of  the  first  tier  can 
benefit  from  the  greater  parallelism,  while  the  wide  track  mode 
supports  the  more  complex  processing  of  the  higher  tier  It 
would  be  unrealistic  to  present  a  two  tier  algorithm  for  a  true 
application  since  the  higher  tier  would  be  complex  beyond  what 
is  necessary  for  illustration.  However,  we  ran  present  an  algo¬ 
rithm  to  solve  a  simple  puzzle.  Word  Find,  which  illustrates  the 
principal  concepts  of  switching  between  different  gauges. 

Word  Find  is  a  common  puzzle  in  which  the  solver  is  pre¬ 
sented  with  an  m  x  m  array  of  letters.  A.  and  a  word  list  IV 
of  size  r  x  ».  i.e.  there  are  r  words,  the  longest  of  which  is  ) 
letters.  The  object  of  the  puzzle  is  to  locale  the  words  of  the  list 
in  the  array  of  letters  as  consecutive  positions  in  a  row.  column 
or  diagonal.  For  example. 

FEE 
FIE 
FO 
FUM 

Finding  the  words  will  be  the  first  tier  problem.  The  words  will 
be  tested  to  see  if  they  are  all  found  exactly  once,  making  the 
h'gber  tier  processing  to  find  if  all  the  words  of  the  list  exist  in 
the  array  without  duplicates. 

The  Word  Find  problem  will  be  solved  on  a  Type  A  (32.  (*) 
multigauge  parallel  machine  of  the  CHiP  architecture  [S]  Re¬ 
call  that  Type  A  machines  display  the  same  architecture  in  both 
wide  and  narrow  tracks,  so  in  the  present  case  both  gauges  are 
assumed  to  be  configurable  We  will  use  an  eight  way  mesh  in¬ 
terconnection  for  the  narrow  gauge  and  a  binary  tree  intercon¬ 
nection  for  the  wide  gauge.  Thus,  each  tree  node  will  correspond 
to  a  2  x  2  mesh  subarray  of  the  narrow  gauge. 

To  simplify  the  presentation,  we  make  some  assumptions 
First  we  assume  that  the  letter  array  A  is  already  loaded  into 
the  processor  array,  one  latter  per  narrow  gauge  PE  Second, 
each  narrow  gauge  PE  has  access  to  the  r  x  j  word  list  IV. 

‘These  ilgomhms  hjvr  slso  be*n  ^ ailed  hx^rrKirai  |7| 
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which  means  there  is  at  least  one  copy  per  wide  track  machine; 
there  may  be  one  per  narrow  track  depending  on  how  memory 
reference  conflicts  are  handled.  Third,  we  simplify  matters  by 
only  searching  for  horizontal  matches;  the  other  cases  are  trivial 
extensions.  Finally,  we  ignore  ‘edge  effects*,  i.e.  we  do  not 
worry  about  the  case  where  the  right  column  PEs  have  no  right 
neighbor. 

The  r  x  a  word  list  W  has  a  special  form.  The  words  are  right 
justified,  padded  (on  the  left)  with  blanks  (t),  augmented  by  a 
blank  column  (=  0)  on  the  left  and  a  blank  row  (=  r  +  1)  on  the 
bottom.  The  words  are  lexicographically  sorted  by  right-most 
position,  i.e.  words  ending  in  a  ‘a*  come  first.  (See  Figure  l  ) 
Also  if  IV j j  is  a  nonblank  character  and  VV,_ for  this 
and  all  larger  values  of  ;  then  VV.y  is  replaced  by  a  ditto  mark 
|").  Finally,  a  bit  vector  find  |j:r).  initially  zero,  is  local  to  each 
narrow  gauge  PE  and  is  assigned  1  in  the  i“  position  if  this  PE 
is  the  first  character  of  an  (horizontal)  instance  of  the  s'*  word. 

The  matching  part  of  the  algorithm  uses  $  iterations  to  locate 
whirl*  of  the  r  s-length  words  match.  During  the  jtk  iteration 
a  PE  reads  a  value  (p)  from  the  east  indicating  either  that  no 
word  matches  in  the  last  j  —  1  positions  (p  =  0),  or  giving  the 
index  of  the  (first)  word  in  the  list  that  matches  in  the  last  ;  -  1 
positions  (p  0)  If  the  match  bad  failed  or  fails  this  time,  the 
p  =  0  value  is  sent  west.  If  the  match  continues  a  p  yf  0  is  sent 
west  If  it  happens  that  a  match  also  succeeds,  this  is  recorded 
in  the  find  vector  Finally,  the  match  that  had  been  found  could 
fall,  but  because  of  the  ditto  marks  (indicating  other  words  with 
the  same  suffix)  the  index  could  be  moved  to  a  subsequent  word. 

We  give  i he  text  of  the  narrow  gauge  program  as  if  for  the 
Poker  parallel  programming  environment  (9). 

coda  match. 

/•  The  information  global  to  this  process  is: 

character  A  The  element  of 

the  word  find 
letter  array  stored 
in  this  narrow 
track  PE. 

character  array  W[i.  r+  l.0..»| 

The  word  list, 
padded  and  right 
justified 

Integer  r.t  The  word  list 

size.  i.e.  num¬ 
ber  of  words  and 
maximum  num¬ 
ber  of  letters. 

•/ 

ports  East,  West; 

begin  integer  i ,j,p;  Boolean  array  find  (1.  r|; 

/•  locate  word  matching  A  in  last  character  •  / 
p  =0;  /•  initialize  to  *none  found*  •/ 
for  i  =  1  to  r  do 

if  IV  (i,  s)  =  A  then  (  p  :=  i; 

if  H  [»,  a  -  i|  ='  *'  then  find(i]  =  1; 
go  to  Cl  ); 

L\  West  —  p;  /•  send  index  to  neighbor  •/ 

/•  general  matching  -  next  to  last  through  first  character  •/ 
for  j  =  s  -  1  to  I  step  -  1  do 


begin  p  —  East;  /•  receive  index  from  neighbor  •/ 
if  p  0  then 
begin 

if  IV [p. j)  =  A  A  IV' [p. )  -  1]  ='6'  then  find  It]  -l, 

L2  :  if  VV[p.  j)  A  A  IV  [p+  l.j  +  1]  =  "" 
then  {  p  :=  p  +  1;  go  to  L2): 

If  IV[p,  j]  A  then  p  =  0 
end; 

West  —  p  /•  send  index  to  neighbor  •/ 

end 

end; 

At  the  completion  of  the  narrow  gauge  programs,  the  machine 
changes  state  and  begins  to  execute  the  wide  track  program. 

The  wide  track  program  uses  a  binary  tree  interconnection  of 
PEs.  Each  node  refers  to  the  find  vectors  of  its  four  constituent 
PEs.  treating  the  values  as  words  and  using  logical  bitwise  op¬ 
erations  on  them.  (The  careful  reader  will  recognize  that  our 
use  of  bitwise  ASDt  here  is  only  a  coincidence  and  has  nothing 
to  do  with  the  discussion  in  the  third  section.)  The  goal  is  to 
recognize  if  all  and  only  the  words  of  the  word  list  appear  in 
the  letter  array,  so  each  node  *  merges"  its  find  vectors;  if  it  is 
a  leaf  it  passes  the  result  to  its  parent,  and  if  it  is  a  nonleaf  it 
“merges’  in  the  results  from  its  two  children  before  passing  the 
result  to  its  parent.  To  perform  the  “merge*  operation,  we  use 
a  function  merje  that  checks  for  and  records  any  collisions  and 
the  unions  the  bit  sequences  together.  At  the  end.  the  outcome 
of  the  collision  tests  is  passed  up  the  tree 

The  code  for  the  wide  track  program  of  nonlraf  node  is  given 
below.  Leaf  programs  would  not  have  the  starred  lines 


coda  combine; 

/•  The  information  global  to  ibis  process  is: 

Boolean  array  PEI  find.  Find  arrays 

PE2  find.  from  the  narrow 
PE3  find,  gauge  PEs 
PE4  find 

Integer  r  Number  of  bits  in  finds 

•/ 

porta  leftchild.  rigbtehild.  parent. 

begin 

Integer  i.  ans.  r.  I: 

logical  temp,  tempi,  temp2; 

logical  array  PElbitsJl.fr/32l), 

PE2bits"jl  ..fr/32lj, 

PE3bils  |l.  [r/32l], 

PEVbits  [I . . fr/32l j; 

/•  make  data  value  correspondence  «/ 
equivalence  (PEI  find,  PEIbits).  (PE2  find.  PF.2bits), 
(PE3  find.  PE3bits).  (PEI  find.  PEAbits); 
function  C(a.b);  logical  a.b: 

(  Iff  a  A  <)  jt  0  then  ans  —  0:C  .=  a  V  i  ); 
ans:  =  l. 
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lor  i  :=  1  to  fr/32l  do 
begin 

tempi. =  CfPElbits,  PE2bits); 
temp2:=  C(PE3bits,  PE4bits); 
temp;  =  Cftempl,  temp2); 
tempi  —  leftehild;  temp  2  —  rigbtcbiid; 


/.../ 


temp:=  C(temp,  tempi); 
temp:  =  C(temp.  temp2); 
parent  —  temp 


end; 

I  —  leftehild, 
r  —  nghtehild', 
ans:  =  ans  x  1  x 
parent  —  ana 
end 


/•  *  •/ 

/.  .  ./ 

/.../ 


The  "parent"  of  the  root,  presumably  the  controller,  receives  the 
results. 

For  the  algorithm  analysis,  we  notice  that  aa  long  as  the  W 
array  bounds  remain  within  the  ‘single  precision"  range  of  the 
narrow  track  PEs,  there  is  essentially  full  speedup  for  the  narrate 
track  phase  of  computation.  The  parallelism  in  the  wide  track 
phase  is  restricted  to  that  provided  by  multiple  PEs  rather  than 
gauge  shifting.  Thus,  we  have  for  some  C|  >  0  and  Cj  >  0, 
Ci(r  +  a)m5  +  Cjrlogm 

steps.  If  we  suppose  that  B  =  32  and  k  =  4  then  we  achieve 
essentially  the  whole  factor  of  four  speedup  on  the  work  repre¬ 
sented  in  the  first  term.  Since  this  is  the  dominate  term  in  the 
computation,  the  benefit  applies  to  the  whole  algorithm.  More¬ 
over,  if  the  problem  size  grows  in  terms  of  m  the  benefits  persist. 

Contluatona 


The  goal  of  this  paper  has  been  to  inquire  into  the  benefits 
of  gauge  shifting.  Towards  this  goal  we  have  defined  multigauge 
architectures  and  related  concepts.  We  have  argued  that  multi¬ 
gauge  computation  represents  a  fundamentally  different  kind  of 
computation,  not  simply  sequential  computation  in  disguise.  Fi¬ 
nally.  we  have  identified  a  class  of  algorithms,  two  tier  algo¬ 
rithms.  that  can  exploit  gauge  shifting. 

The  benefits,  analyzed  in  the  abstract,  seem  to  be  substan¬ 
tial.  suggesting  the  worth  of  a  design  and  implementation  effort 
to  identify  and  quantify  the  problems. 
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